Loading packages

library(jtools) # for summarising regression models
library(tidyverse) # for ggplot2
library(car) # for variance inflation factor
library(plotly) # for making 3d plots
library(ggiraphExtra) # for making 2d multiple regression plots

Load dataset

# Load dataset 
quidditch <- read.delim("first_quidditch.txt")
head(quidditch)
colnames(quidditch)
## [1] "broom_type"     "student_weight" "broom_skills"   "flying_speed"
Explanation of dataset
  • Broom type - make of each broom
  • Student weight - weight of each student in pounds
  • Broom skills - Skills of using racing brooms ranging from 0 to 30
  • Flying speed - Speed in kilometers per hour

Build a multiple regression model

model1 <- lm(flying_speed ~  student_weight + broom_skills, data = quidditch)

summary(model1)
## 
## Call:
## lm(formula = flying_speed ~ student_weight + broom_skills, data = quidditch)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.20948 -0.43717  0.02847  0.42356  1.85770 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    26.954227   0.146793  183.62   <2e-16 ***
## student_weight -0.199242   0.001369 -145.59   <2e-16 ***
## broom_skills    0.176046   0.003546   49.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6553 on 495 degrees of freedom
## Multiple R-squared:  0.9793, Adjusted R-squared:  0.9792 
## F-statistic: 1.172e+04 on 2 and 495 DF,  p-value: < 2.2e-16

Creating clean summary table using jtools

render = 'normal_print'

summ(model1)
Observations 498
Dependent variable flying_speed
Type OLS linear regression
F(2,495) 11723.81
0.98
Adj. R² 0.98
Est. S.E. t val. p
(Intercept) 26.95 0.15 183.62 0.00
student_weight -0.20 0.00 -145.59 0.00
broom_skills 0.18 0.00 49.64 0.00
Standard errors: OLS
Explanation of results
  • Y-intercept The y-intercept value in this case is 26.95.

  • Estimates for the independent variables The estimates suggest that for every one pound increase in student’s weight, there’s an associated 0.2 decrease in speed (kilometers per hour).
    For every one point increase in broom skills, there is an associated 0.18 increase in speed (kilometers per hour).

  • P-value The p-values are very low for both cases (almost zero). Thus, we can reject the null hypothesis and conclude that both student weight and broom skills are likely to influence flying speed.

Visualising the results using a 3D plot with plotly

# simple 3D plot 

plot_ly(data = quidditch, z = ~flying_speed, x = ~student_weight, y = ~broom_skills, opacity = 0.5) %>%
  add_markers( marker = list(size = 5)) 

Visualising multiple regression model with two continuous predictor variables

p <- ggPredict(model1, interactive=TRUE)
p
p2 <- ggPredict(model1, interactive=FALSE)
p2

Assumption 1 - Linearity of the data

We can check the linearity assumption using the Residuals vs Fitted plot.

plot(model1, 1)

The residuals versus fitted plot should show no fitted pattern and the line should be roughly horizontal at zero. This suggest linear relationship between the predictors and outcome variables.

Assumption 2 - Homogeneity of variance (Homoscedasticity)

An assumption of equal or similar variances in different groups.

We can check this assumption using the scale-location plot.

plot(model1, 3)

This plot shows that residuals are spread equally across the ranges of predictors. We can see an approximately horizontal line which indicates homoscedasticity.

Assumption 3 - Residuals are normally distributed

We can use the Q-Q plot of residuals to check the normality assumption. The normal probability plot should follow a straight line.

plot(model1, 2)

Assumption 4 - No collinearity between predictors

In multiple regression, two or more predictor variables might be correlated with one another, which is known as multicollinearity

We can assess for collinearity using the variance inflation factor score, which measures how much the variance of a regression coefficient is inflated due to multicollinearity.

car::vif(model1)
## student_weight   broom_skills 
##       1.000227       1.000227

The smallest value for VIF is 1 and a VIF that exceeds 5 suggests collinearity.
In this case, the VIF is approximately 1 meaning that collinearity is not an issue between the two predictor variables.

Checking for outliers and influential values

We can identify outliers using the residuals vs leverage plot by examining the standardized residual, which is the residual divided by its estimated standard error.

We want to look for outliers where the points are outside of a dashed line, which is the Cook’s distance. The data points outside of the Cook’s distance have high Cook’s distance scores and may influence the regression results.

# Residuals vs Leverage
plot(model1, 5)

In this case, our data does not have any influential points because the Cook’s distance lines (a red dashed line) is not shown on the plot as all data points are well within the Cook’s distance lines.

The end